The goal of punk
is to make available sime wrappers for a variety of machine learning pipelines.
The pipelines are termed primitves
and each primitive is designed with a functional programming approach in mind.
At the time of this writing, punk
is being periodically updated. Any new primitives will be realesed as a pip-installable python package every friday along with their corresponding annotations files for the broader D3M community.
Here we will briefly show how the primitives in the punk package can be utilized.
In [1]:
import punk
help(punk)
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from punk import feature_selection
In [3]:
# Wine dataset
df_wine = pd.read_csv('https://raw.githubusercontent.com/rasbt/'
'python-machine-learning-book/master/code/datasets/wine/wine.data',
header=None)
columns = np.array(['Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline'])
# Split dataset
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X, _, y, _ = train_test_split(X, y, test_size=0.3, random_state=0)
In [13]:
%%time
# Run primitive
rfc = feature_selection.RFFeatures(problem_type="classification",
cv=3, scoring="accuracy", verbose=0, n_jobs=1)
rfc.fit(("matrix", "matrix"), (X, y))
indices = rfc.transform()
In [14]:
feature_importances = rfc.feature_importances
#feature_indices = rfc.indices
for i in range(len(columns)):
print("{:>2}) {:^30} {:.5f}".format(i+1,
columns[indices[i]],
feature_importances[indices[i]]
))
plt.figure(figsize=(9, 5))
plt.title('Feature Importances')
plt.bar(range(len(columns)), feature_importances[indices], color='lightblue', align='center')
plt.xticks(range(len(columns)), columns[indices], rotation=90, fontsize=14)
plt.xlim([-1, len(columns)])
plt.tight_layout()
plt.savefig('./random_forest.png', dpi=300)
plt.show()
In [17]:
# Get boston dataset
boston = datasets.load_boston()
X, y = boston.data, boston.target
In [18]:
%%time
# Run primitive
rfr = feature_selection.RFFeatures(problem_type="regression",
cv=3, scoring="r2", verbose=0, n_jobs=1)
rfr.fit(("matrix", "matrix"), (X, y))
indices = rfr.transform()
In [19]:
feature_importances = rfr.feature_importances
#feature_indices = rfr.indices
columns = boston.feature_names
for i in range(len(columns)):
print("{:>2}) {:^15} {:.5f}".format(i+1,
columns[indices[i]],
feature_importances[indices[i]]
))
plt.figure(figsize=(9, 5))
plt.title('Feature Importances')
plt.bar(range(len(columns)), feature_importances[indices], color='lightblue', align='center')
plt.xticks(range(len(columns)), columns[indices], rotation=90, fontsize=14)
plt.xlim([-1, len(columns)])
plt.tight_layout()
plt.savefig('./random_forest.png', dpi=300)
plt.show()
To provide some context, below we show the correlation coefficients between some of the features in the boston dataset.
Notice how the two features that were ranked the most important by our primitve are also the two features with the highest correlation coefficient (in absolute value) with the dependent variable MEDV
.
This figure was taken from python machine learning book.
In [15]:
import matplotlib.image as mpimg
img=mpimg.imread("heatmap.png")
plt.figure(figsize=(10, 10))
plt.axis("off")
plt.imshow(img);
In [20]:
# Get iris dataset
iris = datasets.load_iris()
sc = StandardScaler()
X = sc.fit_transform(iris.data)
In [22]:
# run primitive
iris_ranking = feature_selection.PCAFeatures()
iris_ranking.fit(["matrix"], X)
importances = iris_ranking.transform()
In [25]:
feature_names = np.array(iris.feature_names)
print(feature_names, '\n')
for i in range(importances["importance_onallpcs"].shape[0]):
print("{:>2}) {:^19}".format(i+1, feature_names[iris_ranking.importance_onallpcs[i]]))
In [23]:
plt.figure(figsize=(9, 5))
plt.bar(range(1, 5), iris_ranking.explained_variance_ratio_, alpha=0.5, align='center')
plt.step(range(1, 5), np.cumsum(iris_ranking.explained_variance_ratio_), where='mid')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.xticks([1, 2, 3, 4])
plt.show()
In [ ]: